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Abstract 


This  paper  describes  the  derivation  of  an  empirically  efficient  parallel  tv/o-dimensional  Delaunay 
triangulation  program  from  a  theoretically  efficient  CREW  PRAM  algorithm.  Compared  to  pre~ 
vknis  work,  the  resulting  implementation  is  not  limited  to  datasets  with  a  uniform  distribution  of 
points,  achieves  significantly  better  speediips  over  good  serial  code,  and  is  widely  portable  due  to 
its  use  of  MPI  as  a  communication  mechanism.  Results  are  presented  for  a  loosely-cjoiipled  cluster 
of  workstations,  two  distributed-memory  multicomputers,  and  a  shared-memory  multiprocessor. 
The  Macffiavelli  toolkit  used  to  transform  the  nested  data  parallelism  inherent  in  the  divide-and- 
conquer  algorithm  into  achievable  task  and  data  parallelism  is  also  described  and  compared  to 
previous  techniques. 
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Introduction 


Delaunay  triangulation  represents  an  important  substep  in  many  computationally-intensive  ap¬ 
plications,  including  pattern  recognition,  terrain  modelling,  and  mesh  generation  for  the  solution 
of  partial  differential  equations.  Delaunay  triangulations  and  their  duals,  Voronoi  diagrams,  are 
among  the  most  widely-studied  structures  in  computational  geometry.  Voronoi  diagrams  have 
also  appeared  in  many  other  fields  under  different  names  [2];  domains  of  action  in  crystallogra¬ 
phy,  Wigner-SeAtz  zones  in  metallingy,  Thiessen  polygons  in  geography,  and  Blum’s  transforms 
in  biology.  This  paper  assumes  that  the  reader  is  familiar  with  the  basic  definition  of  Delaunay 
triangulation  in  namely  the  unique  triangulation  of  a  set  S  of  points  such  that  there  are  no 
elements  of  S  within  the  circumcircle  of  any  triangle. 

There  are  many  well-known  serial  algorithms  for  Delaunay  triangulation.  The  best  have  been 
extensively  analyzed  [17,  36],  and  implemented  as  general-purpose  libraries  [4,  33].  Since  these 
algorithms  are  time  and  memory  intensive,  parallel  implementations  are  important  both  for  im¬ 
proved  performan(*e  and  to  allow  the  solution  of  problems  that  are  too  large  for  serial  machines. 
However,  although  several  parallel  algorithms  for  Delaunay  triangulation  have  been  described  [1, 
32,  13,  27,  20],  practical  implementations  have  been  slower  to  appear.  One  reason  is  that  the 
dynamics  nature  of  the  problem  can  result  in  significant  inter-processor  communication.  Perform¬ 
ing  key  phases  of  the  algorithm  on  a  single  processor  (for  example,  serializing  the  merge  step  of 
a  divide- and-conquer  algorithm,  as  in  [38])  reduces  this  communication,  but  introduces  a  serial 
bottleneck  that  severely  limits  scalability  in  terms  of  both  parallel  speedup  and  achievable  problem 
size.  The  use  of  decomposition  techniques  such  as  biuJieting  [28,  11,  37,  35],  or  striping  [14]  can 
also  reduce  communication,  but  relies  on  the  input  dataset  having  a  uniform  spatial  distribution  of 
points  in  order  to  avoid  load  imbalances  between  processors.  Unfortunately,  while  most  real-world 
problems  are  not  this  uniform,  few  authors  report  the  performance  of  their  implementations  on 
non-uniform  datasets.  Of  those  that  do,  the  3D  algorithm  by  Teng  et  al  [37]  was  up  to  5  times 
slower  on  non-uniform  datasets  than  on  uniform  ones  (on  a  32-processor  CM-5),  while  the  3D  al¬ 
gorithm  by  Cignoni  et  al  [11]  was  up  to  10  times  slower  on  non-imiform  datasets  than  on  uniform 
ones  (on  a  128-processor  nCUBE).  The  2D  algorithm  by  Ding  and  Densham  [15]  is  designed  to  be 
able  to  handle  non-uniform  datasets,  but  has  only  been  demonstrated  to  scale  to  2  processors. 

A  second  problem  is  that  the  parallel  algorithms  are  typically  much  more  complex  than  their 
serial  counterparts.  This  added  complexity  results  in  low  parallel  efficiency:  that  is,  they  achieve 
only  a  small  fraction  of  the  perfect  speedup  over  good  serial  code  running  on  one  processor.  Again, 
direct  comparison  is  difficult  becamse  few  authors  quote  speedups  over  good  serial  code.  Of  those 
that  do,  the  2D  algorithm  by  Su  achieved  speedup  factors  of  3.5-5. 5  on  a  32-processor  KSR-1  [35], 
for  a  parallel  efficiency  of  11-17%,  while  the  3D  algorithm  [28]  by  Merriam  achieved  speedup  factors 
of  6-20  on  a  128-processor  Intel  Gamma,  for  a  parallel  efficiency  of  5-16%.  Both  of  these  results 
were  for  uniform  datasets.  The  2D  algorithm  by  Chew  et  al  [10]  (which  solves  the  more  general 
problem  of  constrained  Delaimay  triangulation  in  a  meshing  algorithm)  achieves  speedup  factors  of 
3  on  an  8-processor  SP2,  but  currently  requires  that  the  boundaries  between  processors  be  created 
by  hand. 

Blelloch,  Miller  and  Talmor  recently  developed  a  CREW  PRAM  algorithm  that  does  not  rely 
on  bucketing  and  hence  can  efficiently  handle  non-uniform  datasets  [8].  It  is  divide-and-conquer 
in  style  but  uses  a  “marriage  before  conquest”  approach  to  eliminate  the  expensive  merge  step 
that  has  hindered  previous  parallel  algorithms.  Additionally,  when  prototyped  in  the  nested  data- 
parallel  language  Nesl  [7],  the  algorithm  was  found  to  perform  only  twice  as  many  floating-point 
operations  as  a  good  serial  algorithm. 
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Figure  1:  Nested  recursion  in  Delaunay  triangulation  algorithm  by  Blelloch  et  al  [8].  Each  recursive 
level  of  the  outer  divide- and-conquer  triangulation  algorithm,  which  has  a  perfect  split,  uses  as 
a  substep  a  divide-and-conquer  convex  hull  algorithm  which  may  have  a  much  more  irregular 
structure. 

This  paper  describes  a  practi(’al  parallel  Delaunay  triangulation  program  which  uses  the  algo¬ 
rithm  by  Blelloch  et  al  as  a  coarse  parallel  partitioner,  switching  to  an  efficient  implementation  of 
Dwyer’s  serial  algorithm  provided  by  the  Triangle  package  [33]  at  the  leaves  of  the  recursion  tree. 
The  program  was  parallelized  using  the  Machiavelli  toolkit  [24],  which  has  been  designed  both  for 
the  direct  implementation  of  parallel  divide-and-conquer  algorithms  (as  in  this  case),  and  as  an 
implementation  layer  for  nested  data-parallel  languages.  It  is  particularly  well-suited  to  exploit¬ 
ing  the  nested  divide-and-conquer  nature  of  the  algorithm  by  Blelloch  et  al,  which  iLses  an  inner 
quickhull  algorithm  as  a  substep,  as  shown  in  Figure  1. 

Since  Machiavelli  uses  MPI  [18]  as  a  (communication  mechanism,  the  resulting  Delaunay  trian¬ 
gulation  program  is  more  portable  than  most  previous  implementations,  which  have  used  vendor- 
speccffic  message-passing  libraries  [14,  28,  11,  37]  or  shared  memory  direcctives  [35].  We  present 
results  for  a  loosely-coupled  DEC  AlphaCluster,  a  distributed-memory  IBM  SP2,  a  distributed- 
memory  Cray  T3D,  and  a  shared-memory  SGI  Power  Challenge.  Experimentally,  the  program 
achieves  three  times  the  parallel  efficiency  of  previous  implementations,  and  is  at  most  1.5  times 
slower  on  non-uniform  datasets  than  on  uniform  ones. 

The  rest  of  the  paper  is  arranged  as  follows.  Section  2  describes  the  Machiavelli  toolkit,  and 
Section  3  outlines  the  parallel  algorithm  by  Blelloch  et  al.  Section  4  disccusses  key  implementation 
decisions.  Section  5  presents  and  analyzes  performance  results  for  a  range  of  input  distributions. 
Finally,  Secction  6  cconciudes  the  paper. 


2  The  Machiavelli  Toolkit 

Many  of  the  most  efficient  and  widely  msed  computer  algorithms  use  a  divide-and-conquer  ap¬ 
proach  to  solving  a  problem.  Examples  inciude  binary  search,  quicksort  [26],  and  fast  matrix 
multiplication  [34].  The  subproblems  that  are  generated  can  typically  be  solved  independently,  and 
hence  divide-and-conquer  algorithms  have  long  been  recognized  as  presenting  a  potential  somce 
of  parallelism.  This  has  resulted  in  many  architect, ures  and  parallel  programming  languages  being 
designed  specifically  for  the  implementation  of  divide-and-conquer  algorithms  (see  [3]  for  a  survey). 
However,  previous  parallel  divide-and-conquer  models  have  typically  been  Umited  to  regulo,r  algo¬ 
rithms,  in  which  the  sub  problems  are  of  equal  size.  This  excludes  a  very  iLseful  class  of  algorithms; 
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for  example,  quicksort,  selection,  and  many  computational  geometry  algorithms  all  have  an  irreg¬ 
ular  divide-and-conquer  structure.  Conco^tenoied  paroMedism  is  a  notable  recent  exception  that  can 
handle  irregular  divide-and-conquer  algorithms,  but  can  only  outperform  a  task-parallel  approach 
when  the  communication  cost  of  redistributing  the  data  is  significant  compared  to  the  computa¬ 
tional  cost  of  subdividing  the  task  [19].  The  alternative  approach  of  using  a  more  general  language 
model  to  handle  irregular  algorithms  runs  the  risk  of  hiding  divide-and-conquer  parallelism  that 
would  otherwise  be  easy  to  exploit.  For  example,  although  nested  data-parallel  languages  such  as 
Nesl  [7]  and  ProteiLS  [31]  are  well-suited  for  expressing  irregular  divide-and-conquer  algorithms, 
their  current  implementation  layer  assumes  a  vector  PRAM  model  [6].  This  can  be  efficiently  im¬ 
plemented  on  vector  processors  with  high  memory  bandwidth,  but  it  is  harder  to  do  so  on  current 
RISC-based  NUMA  multiprocessor  architectures,  due  to  the  higher  relative  costs  of  communication 
and  poor  data  locality  [21]. 

Ma(*hiavelli  [24]  is  a  new  parallel  toolkit  for  divide-and-conquer  algorithms  that  is  intended  to 
alleviate  some  of  these  problems.  It  is  designed  to  be  usable  both  as  an  implementation  layer  for 
languages  such  as  Nesl,  and  as  a  programmer’s  toolkit  for  the  direct  implementation  of  efficient 
parallel  programs.  There  were  three  main  goals  in  Mac'hiavelli’s  development.  First,  it  must  be 
capable  of  handling  divide-and-conquer  algorithms  that  are  both  irregular  and  nested]  that  is, 
algorithms  that  use  a  different  divide-and-conquer  algorithm  as  a  substep.  Secxmd,  it  miLst  be 
portable  across  parallel  architectures.  Finally,  it  must  be  efficient;  that  is,  it  must  result  in  parallel 
programs  that  show  significant  speedup  over  good  serial  programs. 

To  a(ihieve  the  first  goal,  Machiavelli  uses  recursive  subdivision  of  asynchronous  teams  of  pro¬ 
cessors  running  SPMD  code  to  directly  implement  the  behavior  of  a  divide-and-conquer  algorithm 
(this  can  be  seen  a  generalized  version  of  the  technique  used  by  Barnard’s  spectral  bisection  algo¬ 
rithm  on  the  Cray  T3D  [5]).  To  achieve  the  second  goal,  Machiavelli  is  implemented  using  C  and 
MPI  (the  standard  Message  Passing  Interface  [18]).  To  achieve  the  third  goal,  Machiavelli  obtains 
parallelism  from  both  data-parallel  operations  within  teams  and  from  the  task-parallel  invocation 
of  recursive  functions  on  different  teams,  and  mses  good  serial  code  where  possible. 

The  Machiavelli  toolkit  currently  consists  of  a  library  of  vector  primitives  and  a  small  run-time 
system.  The  library  implements  the  basic  communication  and  team  functions  in  terms  of  MPI;  as 
an  example.  Figure  2  shows  source  code  for  the  team-splitting  phase  of  the  Delaunay  triangulation 
program.  The  run-time  system  adds  the  ability  to  perform  dynamic  load-balancing  for  irregular 
algorithms.  Specifically,  it  can  ship  a  recursive  serial  function  call  to  an  idle  processor  in  order  to 
redistribute  computation  [22]. 

A  Machiavellian  divide-and-conquer  program  consists  of  both  serial  and  SPMD  parallel  code. 
The  parallel  code  operates  at  the  upper  levels  of  recursion,  and  uses  calls  to  the  library  to  redis¬ 
tribute  data  and  subdivide  the  problem.  Initially,  data  is  distributed  in  block  fashion  across  all  the 
processors.  The  processors  act  as  a  data-parallel  team,  synchronizing  with  each  other  only  when 
necessary  to  exchange  data.  At  the  first  recursive  call,  the  initial  team  splits  into  two  new  teams, 
with  the  relative  number  of  processors  in  each  new  team  being  chosen  to  approximate  the  relative 
cost  of  each  of  the  sub-problems.  The  teams  repeat  the  process,  recnirsing  down  to  smaller  and 
smaller  teams,  which  run  asyncihronoiisly  with  respect  to  each  another.  When  a  team  contains  a 
single  procjessor,  that  processor  switches  to  serial  code.  Note  that  this  can  be  a  more  efficient  serial 
algorithm,  rather  than  just  a  serial  translation  of  the  parallel  algorithm.  When  the  serial  code 
finishes,  parallel  teams  are  reformed  and  results  are  combined  on  the  way  bach  up  the  recnusion 
call  tree. 

Within  the  Machiavelli  toolkit,  data  is  expressed  as  private  scalars  and  distributed  vectors.  For 
c-ommunication,  teams  are  mapped  to  MPI  groups,  and  user-defined  types  are  mapped  to  MPI’s 
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vec^point  parallel.DT  (team  T,  vec_point  P,  vec^border  B) 

{ 

team  T^new; 

vec.point  P_L,  P^R,  P_new,  result; 
vec.border  B_L,  B„R,  B_new; 

...  /*  Compute  P_L,  P„R,  B_L,  B_R  */  ... 

/*  Create  two  new  teams  according  to  ratio  of 
*  subproblem  sizes,  join  one  of  them  ♦/ 

T_new  =  split _teams  (T,  P„L->len,  P_R“>len) ; 

/*  Split  P_L/P_R  cOid  B^L/B^R  between  the  teams  */ 

P_new  =  split_vec_point  (P„L,  P.R,  T,  T_new) ; 

B_new  =  split_vec_border  (B_L,  B_R,  T,  T_new) ; 
free  (P_L) ;  free  (P_R) ;  free  (B_L) ;  free  (B3) ; 

/*  Is  my  new  team  just  a  single  processor?  */ 
if  (T_new“>nproc  ==  1)  { 

/*  Then  run  serial  Delaunay  triangulation  ♦/ 
result  =  serial_DT  (P_new,  B^new) ; 

}  else  { 

/*  Else  recurse  in  parallel  code  in  new  team  */ 
result  =  parallel.DT  (T_new,  P.new,  B_new) ; 

} 

/*  Teams  rejoin  here  */ 
return  result; 


Figure  2:  Team-splitting  in  Delaunay  triangulation  program  (see  Figure  3),  expressed  as  SPMD  C 
oode  with  calls  to  Machiavelli  library. 


derived  datatypes  [18].  A  purely  data-parallel  operation  (such  as  an  elementwise  mathematical 
fimction  on  a  vector)  is  implemented  as  a  loop  over  the  appropriate  section  of  data  on  each  processor. 
More  complex  functions,  such  as  finding  the  largest  element  in  a  vector,  involve  both  a  local  loop 
and  an  MPI  reduction  operation  within  the  current  team.  Finally,  vector  communication  functions 
are  implemented  using  MPI’s  all-to-all  {X)mmunication  primitives.  Optimized  library  fimctioiLS  are 
provided  for  many  of  the  idionas  that  occur  in  divide-and-conquer  algorithnas,  such  as  sending 
subsets  of  a  vector  to  sub-teams,  and  merging  data  from  sub-teams  into  a  single  vector. 


3  Delaunay  Triangulation  Algorithm 

This  section  briefly  outlines  use  of  the  parallel  Delaunay  triangulation  algorithm  by  BlelkxJi, 
Miller,  and  Talmor  as  a  coarse  partitioner.  The  algorithm  uses  the  well-known  reduction  of 
two-dimensional  Delaunay  triangulation  to  the  problem  of  finding  the  three-dimensional  convex 
hull  of  points  on  a  sphere  or  paraboloid.  The  resulting  algorithm  is  divide-and-conquer  in  na¬ 
ture  but  uses  a  “marriage  before  conquest”  approach,  similar  to  the  DeWall  triangulation  al¬ 
gorithm  [12],  which  enables  it  to  avoid  an  expensive  merge  step.  See  [8]  for  more  details,  and 
http :  /  /web .  scandal .  cs .  emu .  edu/cgi -bin/demo  for  an  interactive  demonstration. 
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Algorithm;  ParDel(P,  jB,  T) 

Input:  P,  a  set  of  points  in  P,  a  set  of  Delaunay  edges  of  P  which  is  the  border 
of  a  region  in  containing  P,  and  T,  a  team  of  processors. 

Output:  The  set  of  Delaimay  triangles  of  P  which  are  contained  within  B, 

Method: 

1.  If  \T\  —  1,  return  Serial(P,P). 

2.  Find  the  point  q  that  is  the  median  along  the  x  axis  of  all  internal  points  (that 
is,  points  in  P  that  are  not  on  the  boundary  B).  Let  £  be  the  hne  x  ~  q^. 

3.  Let  P'  =  {(j)y  -  qy,  ||p-^?|P)  |  ipx.Py)  ^  derived  from  projecting  the  points 
P  onto  a  3Z>  paraboloid  centered  at  q,  and  then  onto  the  vertical  plane  through 
the  hne  £. 

4.  Let  H  =  L0WER._C0NVEX_Hull(P').  "H  is  a  path  of  Delaimay  edges  of  the  set 
P.  Let  P'u  be  the  set  of  points  on  the  path  and  %  be  the  path  H  traversed 
in  the  opposite  direction. 

5.  Create  the  input  for  left  and  right  subproblems: 

B0RDER-Merge{P,  U) 

Border_Merge(P,  11) 

{p  G  P  I  p  is  left  of  £}  U  {p'  €  Pt^  |  p'  contributed  to  B^} 

{p  G  P  I  p  is  right  of  £}  U  {p'  6  P?^  |  p'  contributed  to  B^} 
subset  of  T  of  size  |T||P^|/(1P^|  4-  |P^1) 

r£i  _  rph 

6.  Return  ParDel(P^,P^,T^)  U  ParDel(P^, 

Figure  3:  Recursive  divide-and-conquer  SPMD  pseudocode  for  Delaimay  triangulation,  using  the 
algorithm  by  Blelloch  et  al  [8]  as  a  coarse  partitioner  (a  correction  has  been  made  to  step  5). 
Although  the  algorithm  uses  alternating  x  and  y  cuts,  for  simplicity  only  the  x  cut  is  shown.  The 
three  subroutines  SERIAL,  Lower«Convex_Hull,  and  Border_Merge  are  described  in  the  text. 


B^  = 
P^  = 

pR  _ 
rph  ^ 

fJ'R  = 


Pseudocode  for  the  algorithm  is  shown  in  Figure  3.  It  has  three  important  substeps: 

Serial  Delaunay:  Although  any  serial  Delaunay  triangulation  algorithm  can  be  used  for  the  base 
case,  Dwyer’s  [16]  is  recommended  since  it  has  been  shown  experimentally  to  be  the  fastest  [36, 
33]. 

Lower  convex  hull:  The  lower  half  of  the  convex  huU  of  the  projected  points  is  used  to  find  a 
new  path  1-i  that  divides  the  problem  into  two  smaller  problems.  Since  the  projection  is  based 
on  the  median  point,  the  division  is  perfect,  as  shown  in  Figure  1. 

Border  merge:  This  routine  takes  the  old  border  B  and  merges  it  with  the  newly-foimd  dividing 
path  H  to  form  a  new  border  for  a  recursive  call.  The  new  border  is  computed  based  on  an 
inspection  of  the  possible  intersections  of  points  in  B  and 
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3.1  Theoretical  Performance 


The  full  Delaimay  triangnlation  algorithm  by  Blelloch  et  al  requires  0{nlogn)  parallel  work  and 
0(log''^n)  parallel  depth  on  a  CREW  PRAM  (i.e.,  a  total  time  of  0((n  log  n)/P  +log''^n)  on  P 
processors).  The  work  complexity  is  optimal  for  Delaunay  triangulation,  and  the  depth  complexity 
is  practical  for  parallelization  purposes. 

Note  that  these  complexities  assume  that  the  lower  convex  hull  substep  is  solved  using  a  linear- 
work  algorithm,  which  is  possible  since  we  can  store  the  points  in  sorted  order  [29].  However, 
Blelloch  et  al  foimd  experimentally  that  a  simple  quickhull  [30]  was  faster  than  a  more  complicated 
convex  hull  algorithm  that  was  guaranteed  to  take  linear  time.  Furthermore,  using  a  point-pruning 
version  of  quickhull  that  limits  possible  imbalances  between  recursive  calls  [9]  reduces  its  sensitivity 
to  non-uniform  datasets. 

With  these  changes,  the  parallel  Delaunay  triangulation  algorithm  was  found  to  perform  about 
twice  as  many  floating-point  operations  as  Dwyer’s  algorithm  [16].  Furthermore,  the  cumulative 
floating-point  operation  count  was  found  to  increase  uniformly  with  recnusion  depth,  indicating 
that  the  algorithm  should  be  usable  as  a  partitioner  without  loss  of  efficiency. 

However,  as  implemented  in  Nesl,  the  algorithm  was  an  order  of  magnitude  slower  on  one 
processor  than  a  good  serial  algorithm.  It  was  (iiosen  as  a  test  case  for  Machiavelli  due  to  its 
recursive  divide-and-conquer  nature,  the  natural  match  of  the  partitioning  variant  to  MachiaveUi’s 
ability  to  use  efficient  serial  code,  and  its  nesting  of  a  recursive  convex  hull  algorithm  within  a 
recursive  Delaunay  triangulation  algorithm,  as  shown  in  Figure  1. 


4  Implementation 

This  section  describes  several  implementation  decisions  and  optimizations  that  affect  the  perfor¬ 
mance  of  the  final  program,  including  using  the  right  data  structures,  improving  the  performance 
of  specific  algorithmic  substeps,  and  using  a  general  optimization  that  (;an  be  appfied  to  many 
Machiavelli  programs.  Most  of  the  optimizations  relate  to  reducing  or  eliminating  inter  processor 
communication.  Further  analysis  can  be  found  in  Section  5. 

Data  structures  The  basic  data  structure  used  by  the  (X)de  is  a  pointy  represented  using  two 
double-precision  floating-point  values  for  the  x  and  y  coordinates,  and  two  integers,  one  serving 
as  a  unique  global  identifier  and  the  other  as  a  commimication  index  within  team  phases  of  the 
program.  Points  are  stored  in  vectors,  which  are  distributed  in  a  block  fashion  across  the  processors 
of  the  current  team.  A  border  is  composed  of  corner.%  each  of  whivh  represents  the  triplet  of  points 
corresponding  to  two  segments  in  a  path.  Corners  are  not  balanced  across  the  processors  as  points 
are,  but  rather  are  stored  on  the  same  processor  as  their  “middle”  point.  A  vector  of  indices  I 
links  the  points  in  P  with  the  corners  in  the  borders  B  and  H.  Given  these  data  structures,  the 
operatious  of  finding  internal  points,  and  projei'ting  points  onto  a  parabola  (see  Figure  3),  both 
reduce  to  simple  local  loops. 

Serial  triangulation  For  the  serial  base  case  we  use  the  optimized  version  of  Dwyer’s  algorithm 
that  is  provided  by  the  Triangle  mesh  generation  package  [33].  Since  the  input  format  for  Triangle 
differs  from  that  used  by  the  parallel  program,  conversion  steps  are  necessary  before  and  after 
calling  it.  These  translate  between  the  pointer-based  format  of  Triangle,  which  is  optimized  for 
serial  code,  and  the  indexed  format  with  replication  used  by  the  parallel  code.  No  changes  are 
necessary  to  the  source  code  of  Triangle. 
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Finding  the  median  Initially  a  parallel  version  of  quiekmedian  [25]  was  iLsed  to  find  the  median 
internal  point  along  the  x  or  y  axis.  Quickmedian  redistributes  data  amongst  the  processors  on 
each  recursive  step,  resulting  in  high  communication  overhead.  It  was  therefore  replaced  with  a 
median-of-medians  algorithm,  in  which  each  processor  first  uses  a  serial  quickmedian  to  compute 
the  median  of  its  local  data,  then  contributes  this  local  median  in  a  collective  communication 
step,  and  finally  computes  the  median  of  all  the  medians.  The  result  is  not  guaranteed  to  be  the 
exact  median,  but  in  practice  it  is  sufficiently  good  for  load-balancdng  purposes;  this  modification 
increased  the  speed  of  the  overall  Delaunay  triangulation  program  for  the  datasets  and  machine 
sizes  studied  (see  Section  5)  by  4-30%. 

Finding  the  lower  convex  hull  As  in  the  original  algorithm,  a  variant  of  quickhull  [30]  is  used 
to  find  the  convex  hull,  resulting  in  two  nested  recursive  algorithms  as  shown  in  Figure  1.  However, 
in  contrast  to  using  a  different  serial  Delaunay  triangulation  algorithm,  the  serial  code  of  quickhull 
implements  the  same  algorithm  as  the  parallel  code. 

The  basic  quickhull  algorithm  tends  to  pick  extreme  “pivot”  points  when  operating  on  non- 
uniform  point  distributions,  resulting  in  a  poor  division  of  data  and  a  (jonsequent  lack  of  progress. 
Chan  et  al  [9]  describe  a  variant  that  tests  the  slope  between  pairs  of  points  and  mses  pruning 
to  guarantee  that  recursive  calls  have  at  most  3/4  of  the  original  points.  However,  pairing  all  n 
points  and  finding  the  median  of  their  slopes  is  a  significant  addition  to  the  basic  cost  of  quickhull. 
Experimentally,  pairing  only  y/n  points  was  found  to  give  better  performance  when  used  as  a 
substep  of  the  Delaimay  triangulation  program  (see  Section  5.4  for  an  analysis).  As  with  the 
median-of-medians  approach,  the  global  effects  of  receiving  approximate  results  from  an  algorithm 
substep  are  more  than  offset  by  the  decrease  in  running  time  of  the  substep. 

Combining  results  The  quickhull  algorithm  concatenates  the  results  of  two  recursive  calls  before 
returning.  Using  Machiavelli  this  corresponds  to  merging  two  teams  of  processors  and  redistributing 
their  results  to  form  a  new  vector.  However,  since  this  is  the  last  operation  that  the  function 
performs,  the  intermediate  appends  in  the  parallel  call  tree  (and  their  associated  interprocessor 
communication  phases)  can  be  optimized  away.  They  are  replaced  with  a  single  call  to  Machiavelli 
at  the  top  level  that  redistributes  the  serial  result  from  each  processor  into  a  parallel  vector  shared 
by  all  the  processors.  This  general  Machiavelli  optimization  is  also  applied  to  merging  the  results 
of  the  Delaunay  triangulation  algorithm  itself. 

Creating  the  subproblems  The  border  merge  step  interse(^ts  the  current  border  B  with  the 
dividing  path  creating  two  new  borders  and  B^-  and  two  new  point  sets  and  P^\ 
This  requires  a  series  of  line  orientation  tests  to  decide  how  to  merge  corners.  To  eliminate  an 
interprocessor  communication  phase  in  this  step,  the  two  outer  points  represented  by  a  corner  are 
replicated  in  the  corner  structure.  All  the  information  required  for  the  line  orientation  tests  can 
thiLs  be  found  on  the  local  processor  (the  memory  cost  of  this  replication  is  analyzed  in  Section  5.3). 
Additionally,  although  Figure  3  shows  two  calls  to  the  border  merge  func^tion,  one  for  ea(di  direcjtion 
of  the  new  dividing  path,  in  practic:e  it  is  faster  to  make  a  single  pass,  (treating  both  new  borders 
and  point  sets  at  the  same  time. 

5  Experimental  Results 

The  goal  of  this  section  is  to  vaUdate  the  claims  of  portability,  good  parallel  efficiency,  and  the 
ability  to  handle  non-uniform  datasets.  We  aLso  analyze  where  the  bottlenecks  are,  the  reasons  for 
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Uniform  distribution.  Normal  distribution.  Kuzmin  distribution.  Line  singularity. 

Figure  4:  Examples  of  1000  points  in  each  of  the  four  test  distributions  (taken  from  [8];  for  clarity, 
the  Kuzmin  distribution  is  shown  zoomed  on  the  central  point).  Parallehzation  techniques  that 
assume  uniform  distributions,  such  as  bucketing,  suffer  from  poor  performance  on  the  Kuzmin  and 
line  distributions. 


any  lack  of  scalability,  and  the  effect  of  some  of  the  implementation  decisions  presented  in  Section  4 
on  both  running  time  and  memory  use. 

To  test  portability,  we  used  three  parallel  architectures:  a  loosely-coupled  workstation  cluster 
(DEC  AlphaCluster)  with  8  processors,  a  shared-memory  SGI  Power  Challenge  with  16  processors, 
a  distributed-memory  Cray  T3D  with  64  processors,  and  a  distributed-memory  IBM  SP2  with  16 
processors.  To  test  parallel  efficiency,  we  compared  timings  to  those  on  one  processor,  when  the 
program  immediately  switches  to  the  serial  Triangle  package  [33].  To  test  the  ability  to  handle 
non-uniform  datasets  we  used  four  different  distributions  taken  from  [8]: 


Uniform  distribution:  The  coordinates  x  and  y  are  chosen  at  random  within  the  unit  square. 

Normal  distribution:  The  coordinates  x  and  y  are  chosen  as  independent  samples  from  the 
normal  distribution. 


Kuzmin  distribution;  This  is  an  example  of  convergence  to  a  point,  and  is  used  by  astrophysicists 
to  model  the  distribution  of  star  clusters  within  galaxies.  It  is  radially  symmetric,  with  density 
falling  rapidly  with  distance  r  from  a  central  point.  The  accumulative  probability  function  is 


M(r)  =  1  - 


1 

\/l  +  7^ 


Line  singularity;  This  is  an  example  of  convergence  to  a  line,  resulting  in  a  distribution  that 
cannot  be  efficiently  parallelized  using  techniques  such  as  bucheting.  It  is  defined  using  a 
constant  b  (set  here  to  0.001)  and  a  transformation  from  a  imiform  distribution  (u,  w)  of 


(••K.y)  =  (: 


■hu  +  h’ 


Examples  of  these  distributions  are  shown  in  Figure  4.  All  timings  represent  the  average  of  five 
runs  iLsing  different  seeds  for  a  pseudo-random  number  generator.  For  a  given  problem  size  and 
seed  the  input  data  is  the  same  regardless  of  the  architecture  or  number  of  processors. 
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Processors 

SP2 

SGI 

T3D 

Alpha 

1 

8.42s 

(1.00) 

lC).48s 

(1.00) 

13.028 

(1.00) 

13.19s 

(1.00) 

5.42s 

(1.55) 

6.12s 

(1.71) 

8.05s 

(1.62) 

7.63s 

(1.73) 

3.31s 

(2.55) 

3.28s 

(3.20) 

4.25s 

(3.06) 

5.17s 

(2.55) 

2.19s 

(3.85) 

1.96s 

(5.36) 

2.53s 

(5.16) 

3.89s 

(3.39) 

Processors 

SP2 

SGI 

T3D 

Alpha 

1 

10.82s  (1.00) 

14.11s  (1.00) 

16.05s  (1.00) 

16.39s  (1.00) 

2 

6.04s  (1.79) 

6.91s  (2.04) 

8.13s  (1.97) 

7.93s  (2.07) 

4 

4.36s  (2.48) 

4.84s  (2.92) 

5.58s  (2.88) 

6.36s  (2.58) 

8 

2.52s  (4.30) 

2.56s  (5.51) 

2.96s  (5.43) 

4.36s  (3.76) 

Table  1:  Time  taken,  and  relative  speedup,  when  triangulating  128k  points  on  the  four  different 
platforms  tested.  The  times  shown  are  for  the  irregular  Kuzmin  (top)  and  line  singularity  (bottom) 
distributions. 

5.1  Analysis 

To  illiLStrate  the  algorithm’s  parallel  efficiency,  Figure  5  shows  the  time  to  triangulate  a  relatively 
small  problem  (128k  points)  on  different  numbers  of  processors,  for  eadi  of  the  three  platfornLS  and 
the  four  different  datasets.  Speedup  is  not  perfect  becaiLse  as  more  processors  are  added,  more  levels 
of  recursion  are  spent  in  parallel  code  rather  than  in  the  more  efficient  serial  code.  However,  we  still 
achieve  approximately  60%  parallel  efficiency  for  the  datasets  and  machine  sizes  tested — that  is, 
we  achieve  about  half  of  the  perfect  speedup  over  good  serial  (X)de.  Additionally,  the  Kuzmin  and 
line  distributions  show  similar  speedups  to  the  imiform  and  normal  distributions,  suggesting  that 
the  algorithm  is  effective  at  handling  non-uniform  datasets  as  well  as  imiform  ones.  Data  for  the 
Kuzmin  and  line  singularity  distributions  are  also  shown  in  Tible  1.  Note  that  the  Cray  T3D  and 
the  DEC  AlphaCliLster  mse  the  same  150MHz  Alpha  21064  processors,  and  their  single-processor 
times  are  thus  comparable.  However,  the  T3D’s  specialized  interconnection  network  has  lower 
latency  and  higher  bandwidth  than  the  commodity  FDDI  network  on  the  AlphaCluster,  resulting 
in  better  scalability. 

To  illustrate  scalability.  Figure  6  shows  the  time  to  triangulate  a  variety  of  problem  sizes  on 
different  numbers  of  processors.  For  clarity,  only  the  imiform  and  line  distributions  are  shown, 
since  these  take  the  least  and  most  time,  respectively.  Again,  per-processor  performance  degrades 
as  we  increase  the  number  of  processors  because  more  levels  of  recursion  are  spent  in  parallel  code. 
However,  for  a  fixed  number  of  processors  the  performance  scales  well  with  problem  size,  as  we 
would  expect  from  an  O(nlogn)  algorithm. 

To  illustrate  the  relative  costs  of  the  different  components  of  the  algorithm,  Figure  7a  shows 
the  accmnulated  time  per  substep  of  the  algorithm.  The  parallel  substeps  of  the  algorithm,  namely 
median,  convex  hull,  and  sphtting  and  forming  teams,  become  more  important  as  the  number  of 
processors  is  increased.  The  time  taken  to  convert  to  and  from  Triangle’s  data  format  is  insignificant 
by  comparison,  as  is  the  time  spent  in  the  complicated  but  purely  local  border  merge  step. 

Figure  7b  shows  the  same  data  from  a  different  view,  as  the  total  time  per  recursive  level  of  the 
algorithm.  This  clearly  shows  the  effect  of  the  extra  parallel  phases  as  the  number  of  processors  is 
increased. 

Finally,  Figure  8  iLses  a  parallel  time  line  to  show  the  activity  of  each  processor  when  triangulat¬ 
ing  a  line  singularity  dataset.  There  are  several  important  effects  that  can  be  seen  here.  First,  the 
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Figure  5:  Speedup  of  Delaunay  triangulation  program  for  four  input  distributions  and  four  parallel 
architectures.  The  graphs  show  the  time  to  triangulate  a  total  of  128k  points  as  the  number  of 
processors  is  varied.  Single  processor  results  are  for  good  serial  code  (Triangle  [33]).  Increasing 
the  nmnber  of  processors  results  in  more  levels  of  recmsion  being  spent  in  slower  parallel  code 
rather  than  faster  serial  code,  and  hence  the  speedup  is  not  linear.  The  effect  of  starting  with  an 
X  or  y  (nit  is  shown  in  the  alternately  poor  and  good  performance  on  the  highly  directional  line 
distribution.  IBM  SP2  results  are  for  thin  nodes,  using  xlc  -03  and  MPICH  1.0.12.  SGI  Power 
Challenge  results  are  for  R80D0  processors,  msing  cc  -02  and  SGI  MPI.  Cray  T3D  results  use  cc 
-02  and  MPICH  1.0.13.  DEC  AlphaCluster  results  are  for  DEC  3000/500  workstations  connecited 
by  an  FDDI  Gigaswitch,  using  cc  -02  and  MPICH  1.0.12. 
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Figure  6:  Scalability  of  Delaunay  triangulation  program  for  four  input  distributions  and  four  parallel 
architectures.  The  graphs  show  the  time  to  triangulate  16k- 128k  points  per  processor  as  the  number 
of  processors  is  varied  (for  clarity,  only  the  fastest  and  slowest  distributions  are  shown  for  a  given 
number  of  processors).  Machine  setups  are  as  in  Figure  5.  An  additional  graph  for  the  Cray  T3D 
shows  scalability  up  to  64  processors. 
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(a)  Total  time  in  each  snbstep  (b)  Time  in  each  recursive  level 

Figure  7:  Two  views  of  the  execution  time  as  the  problem  size  is  scaled  with  number  of  processors 
(IBM  SP2,  128k  points  per  processor),  (a)  shows  the  total  time  spent  in  each  substep  of  the 
algorithm.  The  time  spent  in  serial  code  remains  approximately  constant,  while  convex  hull  and 
team  operations  (which  includes  synchronization  delays)  are  the  major  overheads  in  the  parallel 
code,  (b)  shows  the  time  per  recursive  level  of  the  algorithm;  note  the  approximately  constant 
overhead  per  level. 


nested  recursion  of  the  convex  hull  algorithm  within  the  Delaunay  triangulation  algorithm.  Second, 
the  alternating  high  and  low  time  spent  in  the  convex  hull,  due  to  the  effect  of  the  alternating  x 
and  y  cuts  on  the  highly  directional  line  distribution.  Third,  the  operation  of  the  processor  teams. 
For  example,  two  teams  of  four  processors  split  into  four  teams  of  two  just  before  the  0.94  second 
mark,  and  further  subdivide  into  eight  teams  of  one  processor  (and  hence  switch  to  serial  code) 
just  after.  Lastly,  the  amount  of  time  wasted  waiting  for  the  slowest  processor  in  the  parallel  merge 
phase  at  the  end  of  the  algorithm  is  relatively  small,  despite  the  very  non-uniform  dataset. 


5.2  Performance  Prediction 

The  total  running  time  of  the  program  for  a  problem  of  size  n  on  P  processors  can  be  calculated 
as  the  sum  of  the  serial  and  parallel  components  per  processor.  The  serial  time  (that  is,  the  time 
spent  in  Triangle)  is  0((n  log  n)/P).  The  number  of  parallel  phases  is  equal  to  the  logarithm  of 
the  number  of  processors,  and  the  time  spent  in  each  phase  is  again  0((nlogn)/P),  so  that  the 
parallel  time  is  0((nlogn)(log  P)/P).  Combining  these  into  an  equation  to  predict  running  time, 
we  have: 

rr.,..  (nlog2«)(fcl+fc2riog2^1) 

T(n,P)  = - p - 

Note  that  this  simplified  formula  ignores  overheads  dependent  only  on  P,  since  for  problem  sizes 
of  interest  these  overheads  are  insignificant  compared  to  the  combined  effects  of  n  and  P  (in 
practical  terms,  the  time  taken  to  send  and  receive  large  amounts  of  data  far  outweighs  the  fixed 
latency  of  the  messages).  We  can  then  measiue  the  parameters  ki  and  k2  for  a  specific  distribution 
and  architecture  in  order  to  be  able  to  predict  the  nmning  time  for  a  given  problem  size  on  a 
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Figure  8:  Activity  of  eight  processors  over  time,  showing  the  parallel  and  serial  phases  of  Delaunay 
triangulation  and  its  inner  convex  hull  algorithm  (IBM  SP2,  128k  points  in  a  line  singularity 
distribiition).  A  parallel  step  level  of  two  phases  of  Delaimay  triangulation  code  smroimding  one 
or  more  convex  hull  phases;  this  rim  has  three  parallel  levels.  Despite  the  non-uniform  dataset  the 
processors  do  appr(}Kimately  the  same  amount  of  serial  work. 
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given  niunber  of  processors.  For  example,  for  a  uniform  distribution  on  the  SGI  Power  Challenge, 
ki  =  4.4  X  10”^  s.  and  =  6.0  x  10“^  s.  predicts  running  times  to  within  7%. 

5.3  Memory  Requirements 

As  explained  in  Section  4,  border  points  are  replicated  in  corner  structures  to  eliminate  the  need 
for  global  communication  in  the  border  merge  step.  An  obvious  objection  to  this  approach  is  that 
it  increases  the  memory  requirements  of  the  program.  Assmning  64-bit  doubles  and  32-bit  integers, 
a  point  (two  doubles  and  two  integers)  and  associated  index  vector  entry  (two  integers)  ocicmpies  32 
bytes,  while  a  corner  (two  replicated  points)  occupies  48  bytes.  However,  since  a  border  is  composed 
of  only  a  small  fraction  of  the  total  number  of  points,  the  additional  memory  required  to  hold  the 
(X)rners  is  relatively  small.  For  example,  in  a  nm  of  512k  points  in  a  line  singularity  distribution  on 
eight  processors,  the  maximmn  ratio  of  corners  to  total  points  on  a  processor  (which  occurs  at  the 
swit<h  between  parallel  and  serial  code)  is  approximately  2,000  to  67,000,  so  that  the  corners  o(x:upy 
less  than  5%  of  required  storage.  Extreme  cases  can  be  manufactmed  by  reducing  the  nimiber  of 
points  per  processor;  for  example,  with  128k  points  the  maximum  ratio  is  apprcjximately  2,000  to 
17,500.  Even  here,  however  the  corners  still  represent  less  than  15%  of  required  storage,  and  by 
reducing  the  niunber  of  points  per  processor  we  have  also  reduced  absolute  memory  requirements. 


5,4  Performance  of  Convex  Hull  Variants 

Finally,  we  investigate  the  performance  of  the  convex  hull  variants  described  in  Section  4.  The  final 
y^-pair  pruning  quickhull  was  benchmarked  against  both  a  basic  quickhull  and  the  original  n-pair 
pruning  quickhull  by  Chan  et  al  [9].  Results  for  an  extreme  case  are  shown  in  Figure  9.  As  can 
be  seen,  the  n-pair  algorithm  is  more  than  twice  as  fast  as  the  basic  quickhull  on  the  non-uniform 
Kuzmin  dataset  (over  all  the  datasets  and  machine  sizes  tested  it  was  a  factor  of  1.03-2.83  faster). 
The  -^/n-pair  algorithm  provides  a  modest  additional  improvement,  being  a  factor  of  1.02-1.30 
faster  than  the  n-pair  algorithm. 
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Figure  9:  Effect  of  different  convex  hull  functions  on  time  to  triangulate  128k  points  on  an  8- 
processor  IBM  SP2.  The  pruning  quickhull  due  to  Chan  et  al  [9]  has  much  better  performance 
than  the  basic  algorithm  on  the  non-uniform  Kuzmin  distribution;  reducing  its  sampling  accuracy 
produces  a  modest  additional  improvement. 


6  Conclusions 

This  paper  has  described  the  use  of  the  Ma(iiiaveUi  toolkit  to  produce  a  fast  and  practical  parallel 
two-dimensional  Delaunay  triangulation  algorithm.  The  code  was  derived  from  a  combination  of 
a  theoretically  efRcaent  CREW  PRAM  parallel  algorithm  and  existing  optimized  serial  code.  The 
resulting  program  has  three  advantages  over  existing  work.  First,  it  is  widely  portable  due  to 
its  use  of  MPI;  it  aciiieves  similar  speedups  on  three  machines  with  very  different  communication 
architetJtures.  Second,  it  can  handle  datasets  that  do  not  have  a  uniform  distribution  of  points  with 
a  relatively  small  impact  on  performance.  Specifically,  it  is  at  most  1.5  times  slower  on  non-uniform 
datasets  than  on  imiform  datasets,  whereas  previoiLs  implementations  have  been  up  to  10  times 
slower.  Finally,  it  has  good  absolute  performance,  achieving  a  parallel  efficiency  (that  is,  percentage 
of  perfect  speedup  over  good  serial  code)  of  greater  than  50%  on  machine  sizes  of  interest.  Previous 
implementations  have  achieved  at  most  17%  parallel  efficiency. 

In  addition  to  these  specific  results,  there  are  two  more  general  conclusions  that  are  well-known 
but  bear  repeating.  First,  constants  matter:  simple  algorithms  are  often  faster  than  more  complex 
algorithms  that  have  a  lower  complexity  (as  shown  in  the  case  of  quickhull).  Second,  quick  but 
approximate  algorithmic  substeps  often  result  in  better  overall  performance  than  slower  approaches 
that  make  guarantees  about  the  quafity  of  their  results  (as  shown  in  the  case  of  the  median-of- 
medians  and  -^n-pair  priming  quickhull). 
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